Generative Oversampling for Mining Imbalanced Datasets

نویسندگان

  • Alexander Liu
  • Joydeep Ghosh
  • Cheryl E. Martin
چکیده

One way to handle data mining problems where class prior probabilities and/or misclassification costs between classes are highly unequal is to resample the data until a new, desired class distribution in the training data is achieved. Many resampling techniques have been proposed in the past, and the relationship between resampling and cost-sensitive learning has been well studied. Surprisingly, however, few resampling techniques attempt to create new, artificial data points which generalize the known, labeled data. In this paper, we introduce an easily implementable resampling technique (generative oversampling) which creates new data points by learning from available training data. Empirically, we demonstrate that generative oversampling outperforms other wellknown resampling methods on several datasets in the example domain of text classification.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

CUSBoost: Cluster-based Under-sampling with Boosting for Imbalanced Classification

Class imbalance classification is a challenging research problem in data mining and machine learning, as most of the real-life datasets are often imbalanced in nature. Existing learning algorithms maximise the classification accuracy by correctly classifying the majority class, but misclassify the minority class. However, the minority class instances are representing the concept with greater in...

متن کامل

Oversampling Method for Imbalanced Classification

Classification problem for imbalanced datasets is pervasive in a lot of data mining domains. Imbalanced classification has been a hot topic in the academic community. From data level to algorithm level, a lot of solutions have been proposed to tackle the problems resulted from imbalanced datasets. SMOTE is the most popular data-level method and a lot of derivations based on it are developed to ...

متن کامل

A Study of Synthetic Oversampling for Twitter Imbalanced Sentiment Analysis

The majority of Twitter sentiment analysis systems implicitly assume that the class distribution is balanced while in practice it is usually skewed. We argue that Twitter opinion mining using learning methods should be addressed in the framework of imbalanced learning. In this work, we present a study of synthetic oversampling techniques for tweet-polarity classification. The experiments we con...

متن کامل

An experimental comparison of classification algorithm performances for highly imbalanced datasets

Imbalanced learning data often emerges during the process of the knowledge discovery in data and presents a significant challenge for data mining methods. In this paper we investigate the influence of class imbalanced data on: artificial intelligence methods i.e. neural networks and support vector machine and on classical classification methods represented by RIPPER and Naïve Bayes classifier. ...

متن کامل

Extracting Predictor Variables to Construct Breast Cancer Survivability Model with Class Imbalance Problem

Application of data mining methods as a decision support system has a great benefit to predict survival of new patients. It also has a great potential for health researchers to investigate the relationship between risk factors and cancer survival. But due to the imbalanced nature of datasets associated with breast cancer survival, the accuracy of survival prognosis models is a challenging issue...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007